Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries

نویسندگان

  • Chang Liu
  • Hwee Tou Ng
چکیده

In this work, we introduce the TESLACELAB metric (Translation Evaluation of Sentences with Linear-programming-based Analysis – Character-level Evaluation for Languages with Ambiguous word Boundaries) for automatic machine translation evaluation. For languages such as Chinese where words usually have meaningful internal structure and word boundaries are often fuzzy, TESLA-CELAB acknowledges the advantage of character-level evaluation over word-level evaluation. By reformulating the problem in the linear programming framework, TESLACELAB addresses several drawbacks of the character-level metrics, in particular the modeling of synonyms spanning multiple characters. We show empirically that TESLACELAB significantly outperforms characterlevel BLEU in the English-Chinese translation evaluation tasks.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

We propose several techniques for improving statistical machine translation between closely-related languages with scarce resources. We use character-level translation trained on n-gram-character-aligned bitexts and tuned using word-level BLEU, which we further augment with character-based transliteration at the word level and combine with a word-level translation model. The evaluation on Maced...

متن کامل

A Character-Aware Encoder for Neural Machine Translation

This article proposes a novel character-aware neural machine translation (NMT) model that views the input sequences as sequences of characters rather than words. On the use of row convolution (Amodei et al., 2015), the encoder of the proposed model composes word-level information from the input sequences of characters automatically. Since our model doesn’t rely on the boundaries between each wo...

متن کامل

A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

We propose a language-independent approach for improving statistical machine translation for morphologically rich languages using a hybrid morpheme-word representation where the basic unit of translation is the morpheme, but word boundaries are respected at all stages of the translation process. Our model extends the classic phrase-based model by means of (1) word boundary-aware morpheme-level ...

متن کامل

Character-based Neural Machine Translation

We introduce a neural machine translation model that views the input and output sentences as sequences of characters rather than words. Since word-level information provides a crucial source of bias, our input model composes representations of character sequences into representations of words (as determined by whitespace boundaries), and then these are translated using a joint attention/transla...

متن کامل

BLEU in Characters: Towards Automatic MT Evaluation in Languages without Word Delimiters

Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU or NIST, are now well established. Yet, they are scarcely used for the assessment of language pairs like English-Chinese or English-Japanese, because of the word segmentation problem. This study establishes the equivalence between the standard use of BLEU in word n-grams and its application at the character level. T...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012